Group Members:

1 Introduction, Definitions

This project is an example of using customer segmentation to define a marketing strategy. In this sample data set which summarizes the usage behavior of approximately 9000 active credit card holders in the last 6 months, our aim is to perform this segmentation in the most accurate way.

2 Import Data

##   CUST_ID    BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES
## 1  C10001   40.90075          0.818182     95.40             0.00
## 2  C10002 3202.46742          0.909091      0.00             0.00
## 3  C10003 2495.14886          1.000000    773.17           773.17
## 4  C10004 1666.67054          0.636364   1499.00          1499.00
## 5  C10005  817.71434          1.000000     16.00            16.00
## 6  C10006 1809.82875          1.000000   1333.28             0.00
##   INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY
## 1                  95.40        0.000            0.166667
## 2                   0.00     6442.945            0.000000
## 3                   0.00        0.000            1.000000
## 4                   0.00      205.788            0.083333
## 5                   0.00        0.000            0.083333
## 6                1333.28        0.000            0.666667
##   ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## 1                   0.000000                         0.083333
## 2                   0.000000                         0.000000
## 3                   1.000000                         0.000000
## 4                   0.083333                         0.000000
## 5                   0.083333                         0.000000
## 6                   0.000000                         0.583333
##   CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT
## 1               0.000000                0             2         1000
## 2               0.250000                4             0         7000
## 3               0.000000                0            12         7500
## 4               0.083333                1             1         7500
## 5               0.000000                0             1         1200
## 6               0.000000                0             8         1800
##    PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
## 1  201.8021         139.5098         0.000000     12
## 2 4103.0326        1072.3402         0.222222     12
## 3  622.0667         627.2848         0.000000     12
## 4    0.0000               NA         0.000000     12
## 5  678.3348         244.7912         0.000000     12
## 6 1400.0578        2407.2460         0.000000     12

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

3 EDA and Cleaning

## 'data.frame':    8950 obs. of  18 variables:
##  $ CUST_ID                         : chr  "C10001" "C10002" "C10003" "C10004" ...
##  $ BALANCE                         : num  40.9 3202.5 2495.1 1666.7 817.7 ...
##  $ BALANCE_FREQUENCY               : num  0.818 0.909 1 0.636 1 ...
##  $ PURCHASES                       : num  95.4 0 773.2 1499 16 ...
##  $ ONEOFF_PURCHASES                : num  0 0 773 1499 16 ...
##  $ INSTALLMENTS_PURCHASES          : num  95.4 0 0 0 0 ...
##  $ CASH_ADVANCE                    : num  0 6443 0 206 0 ...
##  $ PURCHASES_FREQUENCY             : num  0.1667 0 1 0.0833 0.0833 ...
##  $ ONEOFF_PURCHASES_FREQUENCY      : num  0 0 1 0.0833 0.0833 ...
##  $ PURCHASES_INSTALLMENTS_FREQUENCY: num  0.0833 0 0 0 0 ...
##  $ CASH_ADVANCE_FREQUENCY          : num  0 0.25 0 0.0833 0 ...
##  $ CASH_ADVANCE_TRX                : int  0 4 0 1 0 0 0 0 0 0 ...
##  $ PURCHASES_TRX                   : int  2 0 12 1 1 8 64 12 5 3 ...
##  $ CREDIT_LIMIT                    : num  1000 7000 7500 7500 1200 1800 13500 2300 7000 11000 ...
##  $ PAYMENTS                        : num  202 4103 622 0 678 ...
##  $ MINIMUM_PAYMENTS                : num  140 1072 627 NA 245 ...
##  $ PRC_FULL_PAYMENT                : num  0 0.222 0 0 0 ...
##  $ TENURE                          : int  12 12 12 12 12 12 12 12 12 12 ...

Hmm, all variables are numeric other than CUST_ID which will be removed for clustering.

##    CUST_ID             BALANCE        BALANCE_FREQUENCY   PURCHASES       
##  Length:8950        Min.   :    0.0   Min.   :0.0000    Min.   :    0.00  
##  Class :character   1st Qu.:  128.3   1st Qu.:0.8889    1st Qu.:   39.63  
##  Mode  :character   Median :  873.4   Median :1.0000    Median :  361.28  
##                     Mean   : 1564.5   Mean   :0.8773    Mean   : 1003.20  
##                     3rd Qu.: 2054.1   3rd Qu.:1.0000    3rd Qu.: 1110.13  
##                     Max.   :19043.1   Max.   :1.0000    Max.   :49039.57  
##                                                                           
##  ONEOFF_PURCHASES  INSTALLMENTS_PURCHASES  CASH_ADVANCE    
##  Min.   :    0.0   Min.   :    0.0        Min.   :    0.0  
##  1st Qu.:    0.0   1st Qu.:    0.0        1st Qu.:    0.0  
##  Median :   38.0   Median :   89.0        Median :    0.0  
##  Mean   :  592.4   Mean   :  411.1        Mean   :  978.9  
##  3rd Qu.:  577.4   3rd Qu.:  468.6        3rd Qu.: 1113.8  
##  Max.   :40761.2   Max.   :22500.0        Max.   :47137.2  
##                                                            
##  PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY
##  Min.   :0.00000     Min.   :0.00000           
##  1st Qu.:0.08333     1st Qu.:0.00000           
##  Median :0.50000     Median :0.08333           
##  Mean   :0.49035     Mean   :0.20246           
##  3rd Qu.:0.91667     3rd Qu.:0.30000           
##  Max.   :1.00000     Max.   :1.00000           
##                                                
##  PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX 
##  Min.   :0.0000                   Min.   :0.0000         Min.   :  0.000  
##  1st Qu.:0.0000                   1st Qu.:0.0000         1st Qu.:  0.000  
##  Median :0.1667                   Median :0.0000         Median :  0.000  
##  Mean   :0.3644                   Mean   :0.1351         Mean   :  3.249  
##  3rd Qu.:0.7500                   3rd Qu.:0.2222         3rd Qu.:  4.000  
##  Max.   :1.0000                   Max.   :1.5000         Max.   :123.000  
##                                                                           
##  PURCHASES_TRX     CREDIT_LIMIT      PAYMENTS       MINIMUM_PAYMENTS  
##  Min.   :  0.00   Min.   :   50   Min.   :    0.0   Min.   :    0.02  
##  1st Qu.:  1.00   1st Qu.: 1600   1st Qu.:  383.3   1st Qu.:  169.12  
##  Median :  7.00   Median : 3000   Median :  856.9   Median :  312.34  
##  Mean   : 14.71   Mean   : 4494   Mean   : 1733.1   Mean   :  864.21  
##  3rd Qu.: 17.00   3rd Qu.: 6500   3rd Qu.: 1901.1   3rd Qu.:  825.49  
##  Max.   :358.00   Max.   :30000   Max.   :50721.5   Max.   :76406.21  
##                   NA's   :1                         NA's   :313       
##  PRC_FULL_PAYMENT     TENURE     
##  Min.   :0.0000   Min.   : 6.00  
##  1st Qu.:0.0000   1st Qu.:12.00  
##  Median :0.0000   Median :12.00  
##  Mean   :0.1537   Mean   :11.52  
##  3rd Qu.:0.1429   3rd Qu.:12.00  
##  Max.   :1.0000   Max.   :12.00  
## 
## [1] 314

Where is my missing NA values?

I know where is my missing values, so I would like to replace that values with 0.

##    CUST_ID             BALANCE        BALANCE_FREQUENCY   PURCHASES       
##  Length:8950        Min.   :    0.0   Min.   :0.0000    Min.   :    0.00  
##  Class :character   1st Qu.:  128.3   1st Qu.:0.8889    1st Qu.:   39.63  
##  Mode  :character   Median :  873.4   Median :1.0000    Median :  361.28  
##                     Mean   : 1564.5   Mean   :0.8773    Mean   : 1003.20  
##                     3rd Qu.: 2054.1   3rd Qu.:1.0000    3rd Qu.: 1110.13  
##                     Max.   :19043.1   Max.   :1.0000    Max.   :49039.57  
##                                                                           
##  ONEOFF_PURCHASES  INSTALLMENTS_PURCHASES  CASH_ADVANCE    
##  Min.   :    0.0   Min.   :    0.0        Min.   :    0.0  
##  1st Qu.:    0.0   1st Qu.:    0.0        1st Qu.:    0.0  
##  Median :   38.0   Median :   89.0        Median :    0.0  
##  Mean   :  592.4   Mean   :  411.1        Mean   :  978.9  
##  3rd Qu.:  577.4   3rd Qu.:  468.6        3rd Qu.: 1113.8  
##  Max.   :40761.2   Max.   :22500.0        Max.   :47137.2  
##                                                            
##  PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY
##  Min.   :0.00000     Min.   :0.00000           
##  1st Qu.:0.08333     1st Qu.:0.00000           
##  Median :0.50000     Median :0.08333           
##  Mean   :0.49035     Mean   :0.20246           
##  3rd Qu.:0.91667     3rd Qu.:0.30000           
##  Max.   :1.00000     Max.   :1.00000           
##                                                
##  PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX 
##  Min.   :0.0000                   Min.   :0.0000         Min.   :  0.000  
##  1st Qu.:0.0000                   1st Qu.:0.0000         1st Qu.:  0.000  
##  Median :0.1667                   Median :0.0000         Median :  0.000  
##  Mean   :0.3644                   Mean   :0.1351         Mean   :  3.249  
##  3rd Qu.:0.7500                   3rd Qu.:0.2222         3rd Qu.:  4.000  
##  Max.   :1.0000                   Max.   :1.5000         Max.   :123.000  
##                                                                           
##  PURCHASES_TRX     CREDIT_LIMIT      PAYMENTS       MINIMUM_PAYMENTS 
##  Min.   :  0.00   Min.   :   50   Min.   :    0.0   Min.   :    0.0  
##  1st Qu.:  1.00   1st Qu.: 1600   1st Qu.:  383.3   1st Qu.:  163.0  
##  Median :  7.00   Median : 3000   Median :  856.9   Median :  289.6  
##  Mean   : 14.71   Mean   : 4494   Mean   : 1733.1   Mean   :  834.0  
##  3rd Qu.: 17.00   3rd Qu.: 6500   3rd Qu.: 1901.1   3rd Qu.:  788.7  
##  Max.   :358.00   Max.   :30000   Max.   :50721.5   Max.   :76406.2  
##                   NA's   :1                                          
##  PRC_FULL_PAYMENT     TENURE     
##  Min.   :0.0000   Min.   : 6.00  
##  1st Qu.:0.0000   1st Qu.:12.00  
##  Median :0.0000   Median :12.00  
##  Mean   :0.1537   Mean   :11.52  
##  3rd Qu.:0.1429   3rd Qu.:12.00  
##  Max.   :1.0000   Max.   :12.00  
## 

As we see we also see 1 NA value on CREDIT LIMIT. As we know from banking system, CREDIT LIMIT can not be 0, I would like to give a mean value to this NA.

##    CUST_ID             BALANCE        BALANCE_FREQUENCY   PURCHASES       
##  Length:8950        Min.   :    0.0   Min.   :0.0000    Min.   :    0.00  
##  Class :character   1st Qu.:  128.3   1st Qu.:0.8889    1st Qu.:   39.63  
##  Mode  :character   Median :  873.4   Median :1.0000    Median :  361.28  
##                     Mean   : 1564.5   Mean   :0.8773    Mean   : 1003.20  
##                     3rd Qu.: 2054.1   3rd Qu.:1.0000    3rd Qu.: 1110.13  
##                     Max.   :19043.1   Max.   :1.0000    Max.   :49039.57  
##  ONEOFF_PURCHASES  INSTALLMENTS_PURCHASES  CASH_ADVANCE    
##  Min.   :    0.0   Min.   :    0.0        Min.   :    0.0  
##  1st Qu.:    0.0   1st Qu.:    0.0        1st Qu.:    0.0  
##  Median :   38.0   Median :   89.0        Median :    0.0  
##  Mean   :  592.4   Mean   :  411.1        Mean   :  978.9  
##  3rd Qu.:  577.4   3rd Qu.:  468.6        3rd Qu.: 1113.8  
##  Max.   :40761.2   Max.   :22500.0        Max.   :47137.2  
##  PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY
##  Min.   :0.00000     Min.   :0.00000           
##  1st Qu.:0.08333     1st Qu.:0.00000           
##  Median :0.50000     Median :0.08333           
##  Mean   :0.49035     Mean   :0.20246           
##  3rd Qu.:0.91667     3rd Qu.:0.30000           
##  Max.   :1.00000     Max.   :1.00000           
##  PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX 
##  Min.   :0.0000                   Min.   :0.0000         Min.   :  0.000  
##  1st Qu.:0.0000                   1st Qu.:0.0000         1st Qu.:  0.000  
##  Median :0.1667                   Median :0.0000         Median :  0.000  
##  Mean   :0.3644                   Mean   :0.1351         Mean   :  3.249  
##  3rd Qu.:0.7500                   3rd Qu.:0.2222         3rd Qu.:  4.000  
##  Max.   :1.0000                   Max.   :1.5000         Max.   :123.000  
##  PURCHASES_TRX     CREDIT_LIMIT      PAYMENTS       MINIMUM_PAYMENTS 
##  Min.   :  0.00   Min.   :   50   Min.   :    0.0   Min.   :    0.0  
##  1st Qu.:  1.00   1st Qu.: 1600   1st Qu.:  383.3   1st Qu.:  163.0  
##  Median :  7.00   Median : 3000   Median :  856.9   Median :  289.6  
##  Mean   : 14.71   Mean   : 4494   Mean   : 1733.1   Mean   :  834.0  
##  3rd Qu.: 17.00   3rd Qu.: 6500   3rd Qu.: 1901.1   3rd Qu.:  788.7  
##  Max.   :358.00   Max.   :30000   Max.   :50721.5   Max.   :76406.2  
##  PRC_FULL_PAYMENT     TENURE     
##  Min.   :0.0000   Min.   : 6.00  
##  1st Qu.:0.0000   1st Qu.:12.00  
##  Median :0.0000   Median :12.00  
##  Mean   :0.1537   Mean   :11.52  
##  3rd Qu.:0.1429   3rd Qu.:12.00  
##  Max.   :1.0000   Max.   :12.00

Re-checking my values again for NA detection.

There is no missing values on my dataset. Yey!

## 'data.frame':    8950 obs. of  17 variables:
##  $ BALANCE                         : num  40.9 3202.5 2495.1 1666.7 817.7 ...
##  $ BALANCE_FREQUENCY               : num  0.818 0.909 1 0.636 1 ...
##  $ PURCHASES                       : num  95.4 0 773.2 1499 16 ...
##  $ ONEOFF_PURCHASES                : num  0 0 773 1499 16 ...
##  $ INSTALLMENTS_PURCHASES          : num  95.4 0 0 0 0 ...
##  $ CASH_ADVANCE                    : num  0 6443 0 206 0 ...
##  $ PURCHASES_FREQUENCY             : num  0.1667 0 1 0.0833 0.0833 ...
##  $ ONEOFF_PURCHASES_FREQUENCY      : num  0 0 1 0.0833 0.0833 ...
##  $ PURCHASES_INSTALLMENTS_FREQUENCY: num  0.0833 0 0 0 0 ...
##  $ CASH_ADVANCE_FREQUENCY          : num  0 0.25 0 0.0833 0 ...
##  $ CASH_ADVANCE_TRX                : int  0 4 0 1 0 0 0 0 0 0 ...
##  $ PURCHASES_TRX                   : int  2 0 12 1 1 8 64 12 5 3 ...
##  $ CREDIT_LIMIT                    : num  1000 7000 7500 7500 1200 1800 13500 2300 7000 11000 ...
##  $ PAYMENTS                        : num  202 4103 622 0 678 ...
##  $ MINIMUM_PAYMENTS                : num  140 1072 627 0 245 ...
##  $ PRC_FULL_PAYMENT                : num  0 0.222 0 0 0 ...
##  $ TENURE                          : int  12 12 12 12 12 12 12 12 12 12 ...

## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(freq)` instead of `freq` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Warning: Ignoring unknown parameters: binwidth, bins, pad

PCA Analysis

As shown graph above, 10 components cover nearly 80% of variance according to PCA analysis.

4 KMeans Clustering

Kmeans is a simple unsupervised learning algorithm to cluster dataset into groups. Even if the cluster number is not optimal for the dataset, the algorithm splits into the given number of groups. It can be got different results even if the same code runs, because of changing the starting point of the model. Therefore randomization is the essential point of the Kmeans algorithm as the same as other algorithms.

We can start four centers for our model, and let’s look at the results:

##      BALANCE BALANCE_FREQUENCY  PURCHASES ONEOFF_PURCHASES
## 1 -0.3186789         0.2474748  0.1129728      0.004276753
## 2 -0.2659988        -0.3707411 -0.3427854     -0.230659248
## 3  1.4592898         0.3849408 -0.2350014     -0.164145135
## 4  0.9626347         0.4634767  3.1639402      2.748625306
##   INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY
## 1              0.2590371   -0.3649444           0.9868072
## 2             -0.3865510   -0.1829484          -0.7939753
## 3             -0.2541801    1.6872281          -0.5064753
## 4              2.4315805   -0.1559175           1.1341286
##   ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## 1                  0.3233029                        0.8767322
## 2                 -0.3897275                       -0.7102531
## 3                 -0.2137049                       -0.4515378
## 4                  1.8033921                        1.0604485
##   CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT
## 1             -0.4607102       -0.3591435     0.3036958  -0.07308479
## 2             -0.1035736       -0.1658335    -0.4740003  -0.33458515
## 3              1.7485690        1.6192911    -0.2840458   0.83870197
## 4             -0.3254707       -0.1720256     3.0161122   1.43770090
##     PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT      TENURE
## 1 -0.1357102      -0.08443574        0.3948791  0.06049096
## 2 -0.2623140      -0.12819077       -0.2582916 -0.05424328
## 3  0.6032523       0.49933830       -0.4062667 -0.09752305
## 4  1.9468103       0.49277411        0.4704433  0.32339259

The figure shown above plotted by two components which cover 47.62% of the point variability.

## [1] 3363 3989 1197  401

There is a problem that must be solved because the distribution of the size between clusters seems unbalanced. There may be several reasons for this, we could start with control the number of clusters whether the cluster number is true.

There are different methods to find the optimal cluster number in the Kmeans algorithm. We focus on them in the next section to find the best cluster number.

4.1 Determining Optimal Clusters

There are several popular methods to find optimal clusters. We applied two methods listed below:

  • Elbow Method
  • Silhouette Method

They are explained with their examples respectively below.

4.1.1 Elbow Method

In the Elbow method, it is plotted a line chart using within-groups the sum of squares of the model which is applied the Kmeans algorithm with the range of different k values. Because a line plot seems like an arm, this method is named Elbow Method. SSE is decreasing as increasing k value because the distance between clusters decreases. Our goal is to choose k value corresponding to the “Elbow point” in the chart. This point is also the minimum SSE value for our model.

According to the Elbow Method, the elbow point of the model seems seven cluster is the best choice.

4.1.2 Silhouette Method

In the Silhouette method, it is plotted a line chart like an Elbow Method using the range of different k values. While the Elbow Method tries to calculate the error between the clusters, the Silhouette Method tries to estimate the average distance between clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters. therefore, our goal is to decide k value corresponding to a higher average distance between clusters.

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 447500)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 447500)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 447500)

With a different visualization technique for Silhouette Method:

The seven clusters seems the optimal number for our model.

5 Results

It seems seven clusters are good for our model.

##   Group.1   BALANCE BALANCE_FREQUENCY  PURCHASES ONEOFF_PURCHASES
## 1       1  112.5092         0.3576877   306.9341         186.1401
## 2       2 1539.7193         0.9766219  1495.1614         781.7478
## 3       3 2976.6239         0.9503170   222.1891         164.9473
## 4       4 1247.6485         0.9661524   290.8402         230.5123
## 5       5  137.1797         0.9108183  1346.9436         656.8064
## 6       6 4151.3604         0.9844211 10046.3581        6738.1296
## 7       7 6103.8708         0.9734736   865.3894         520.8899
##   INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY
## 1              121.03585    332.59923           0.2659229
## 2              713.63388    352.71995           0.9041611
## 3               57.27007   2697.05845           0.1596098
## 4               60.75994    449.08079           0.1843836
## 5              690.24556     48.36178           0.8648975
## 6             3310.79265    667.50401           0.9414595
## 7              344.66398   7172.18706           0.4116654
##   ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## 1                 0.06771059                       0.18979894
## 2                 0.34916135                       0.73319274
## 3                 0.08581231                       0.07719388
## 4                 0.10001614                       0.08261485
## 5                 0.29078703                       0.67504441
## 6                 0.75859883                       0.78802772
## 7                 0.19474781                       0.30083949
##   CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT
## 1            0.038265000        0.7522412      4.022005     3616.470
## 2            0.064245784        1.2518519     27.009150     4501.043
## 3            0.418021648        9.6332518      3.016300     5154.418
## 4            0.099665131        1.7315191      3.277823     3016.450
## 5            0.008449436        0.1438980     20.985428     4743.488
## 6            0.068019923        1.9615385    104.559829    10718.590
## 7            0.569196821       21.0835381     13.395577     9804.791
##    PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT   TENURE
## 1 1058.4390         125.3299       0.18090335 11.28851
## 2 1584.6602         984.7496       0.05019256 11.74205
## 3 1769.4853        1177.4802       0.03052005 10.95355
## 4  846.0003         694.3229       0.02345691 11.70024
## 5 1443.8112         179.4812       0.77294468 11.38525
## 6 9475.3428        2379.7085       0.31674990 11.95299
## 7 6190.6427        2806.5320       0.05055040 11.63882
## [1] 1227 2295 1227 2462 1098  234  407

## [1] 1227 2295 1227 2462 1098  234  407

6 Summary

The aim was to segment customers to define a marketing strategy. Below you can see the user groups that we have segmented according to the clusters we have reserved.

7 References

[1] https://www.datanovia.com/en/lessons/cluster-validation-statistics-must-know-methods/

[2] https://towardsdatascience.com/10-tips-for-choosing-the-optimal-number-of-clusters-277e93d72d92